Efficient multi-frame motion estimation for video compression

ABSTRACT

There is disclosed a method of digital signal compression, coding and representation, and more particularly a method of video compression, coding and representation system that uses multi-frame motion estimation and includes both device and method aspects. The invention also provides a computer program product, such as a recording medium, carrying program instructions readable by a computing device to cause the computing device to carry out a method according to the invention.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No.60/556,363, filed Mar. 26, 2004, which is hereby incorporated herein byreference in its entirety.

FIELD OF INVENTION

This invention relates generally to digital signal compression, codingand representation, and more particularly to a video compression, codingand representation system using multi-frame motion estimation and havingboth device and method aspects. It further relates to a computer programproduct, such as a recording medium, carrying program instructionsreadable by a computing device to cause the computing device to carryout a method according to the invention.

BACKGROUND OF THE INVENTION

Due to the huge size of the raw digital video data (or image sequences),compression must be applied such data so that they may be transmittedand stored. There have been many important video compression standards,including the ISO/IEC MPEG-1, MPEG-2, MPEG-4 standards and the ITU-TH.261, H.263, H.263+, H.263++, H.264 standards. The ISO/IEC MPEG-1/2/4standards are used extensively by the entertainment industry todistribute movies, digital video broadcast including video compact diskor VCD (MPEG-1), digital video disk or digital versatile disk or DVD(MPEG-2), recordable DVD (MPEG-2), digital video broadcast or DVB(MPEG-2), video-on-demand or VOD (MPEG-2), high definition television orHDTV in the US (MPEG-2), etc. The later MPEG-4 was more advanced thanMPEG-2 and can achieve high quality video at lower bit rate, making itvery suitable for video streaming over internet, digital wirelessnetwork (e.g. 3G network), multimedia messaging service (MMS standardfrom 3GPP), etc. MPEG-4 will be included in the next generation DVDplayers. The ITU-T H.261/3/4 standards are designed for low-delay videophone and video conferencing systems. The early H.261 was designed tooperate at bit rates of p*64 kbit/s, with p=1,2, . . . , 31. The laterH.263 is very successful and is widely used in modern day videoconferencing systems, and in video streaming in broadband and inwireless network, including the multimedia messaging service (MMS) in2.5G and 3G networks and beyond. The latest H.264 (also called MPEG-4Version 10, or MPEG-4 AVC) is currently the state-of-the-art videocompression standard. It is so powerful that MPEG decided to jointlydevelop with ITU-T in the framework of the Joint Video Team (JVT). Thenew standard is called H.264 in ITU-T and is called MPEG-4 Advance VideoCoding (MPEG-4 AVC), or MPEG-4 Version 10. Based on H.264, a relatedstandard called the Audio Visual Standard (AVS) is currently underdevelopment in China. Other related standards may be under development.

H.264 has superior objective and subjective video quality overMPEG-1/2/4 and H.261/3. The basic encoding algorithm of H.264 is similarto H.263 or MPEG-4 except that integer 4×4 discrete cosine transform(DCT) is used instead of the traditional 8×8 DCT and there areadditional features include intra prediction mode for I-frames, multipleblock sizes and multiple reference frames for motionestimation/compensation, quarter pixel accuracy for motion estimation,in-loop deblocking filter, context adaptive binary arithmetic coding,etc.

In particular, H.264 allows the encoder to store five reference picturesfor motion estimation (ME) and motion compensation (MC). However, theprocessing time increases linearly with the number of reference picturesused. This is because motion estimation needs to be performed for eachreference frame in a simple implementation. This full selection process(the examination of all five reference frames) provides the best codingresult but the five-fold increase in computation is very high.

SUMMARY OF THE INVENTION

The present invention seeks to provide new and useful multi-frame motionestimation techniques for any current frame in H.264 or MPEG-4 AVC orAVS or related video coding.

According to the present invention there is provided a method ofmatching a first image object (original current frame) against a numberof reference image object (reference frames), including; definingregions (e.g. non-overlapping rectangular blocks, or some expandedregions similar to those in the overlapping motion compensation inH.263) in the first image object (e.g. frame N); for each region in thefirst image object, defining the corresponding representative coordinate(e.g. upper left corner of a block within the current frame); for eachregion in the first image object, defining a search region (e.g. asearch window of 3×4 in frame N-1, 5×8 in frame N-2, 7×12 in frame N-3and so on) in each of the reference image objects, with different searchregions of possibly different sizes; for each region in the first imageand for a relative coordinate in the search region (a location of thesearch block within the search window) in each of the reference imageobjects, defining a relative coordinate (motion vector) with respect tothe representative coordinate of the said region in the first image, anda mismatch measure (e.g. SAD or MSE or some rate-distortion costfunction) between the said region in the first image and thecorresponding region at the said relative location in the said referenceimage; for each region in the first image object, computing the mismatchmeasure at selected relative locations in the search regions of selectedreference image objects and choosing one or more relative coordinates inthe search regions of the selected reference image objects; followedpossibly by an optional local search at integer-pixel and sub-pixelprecision around the chosen relative coordinates for each region in thefirst image object, constructing a motion-compensated region byconsidering regions (such regions may or may not be of the same size asthe said region in the first image, e.g. they can be the expandedregions used in overlapped motion compensation in H.263) at the saidselected relative coordinates in the said selected reference imageobjects and applying linear or nonlinear operations to combine them toform a motion-compensated region (this region may or may not be of thesame size as the said region in the first image); combining all the saidmotion-compensated regions to form a second image object (predictiveframe) of the same size as the first image object; and forming a thirdimage object (residue frame) of the same size as the first image objectby subtracting the second image object (predictive frame) from the firstimage object (original current frame).

The present invention provides for the intelligent selection of thereference image objects, and the intelligent selection of the relativecoordinates. Such intelligent selection allows us to perform themismatch calculation on a greatly reduced number of candidate referenceframes and candidate relative locations (motion vectors).

In a preferred embodiment a class is defined for each region in thefirst image object; for each region in the first image object, a classis defined for each of the reference image objects; for each region inthe first image object, the selection of the reference image objects andthe selection of relative coordinates in the said reference imageobjects for the computation of the mismatch measure depend on the classof the said region in the first image and the classes of the saidreference image objects, and after the said mismatch computation, one ormore relative coordinates in the search regions of the reference imageobjects are chosen.

In a preferred embodiment a class is defined for each region in thefirst image object; for each region in the first image object, a classis defined for each of the reference image objects; for each region inthe first image object, among all the reference image objects with thesame or similar class, only a number of (e.g. one) image objects isselected for the computation of the mismatch measure; depending on therelationship between the class of the said region in the first imageobject and the class of any said selected reference image object, anumber of the relative coordinates in the search area of the saidselected reference image object are selected for mismatch computation.Complete or partial mismatch computations are both possible. After thesaid mismatch computation, one or more relative coordinates in thesearch regions of the reference image objects are chosen. This may befollowed by an optional local search at integer-pixel or sub-pixelprecision. In one version of this embodiment the selected image objectwithin a class is the image object that is closest in time within theclass (e.g. the nearest past frame with the same class).

In a preferred embodiment of the invention for each region in the firstimage object, a class is defined for each of the reference image objects(a class is defined for each reference frame); for each region in thefirst image object (e.g. frame N), a second image object (e.g. frameN-1) is defined and a class is defined for the second image object; foreach region in the first image object, a class is defined for the saidregion according to the class of the second image object, and therelative coordinate in the search area in the second image objectachieving the least mismatch measure between the first image object andthe second image object; for each region in the first image object, theselection of the reference image objects and the selection of relativecoordinate in the said reference image objects for the computation ofthe mismatch measure depends on the class of the said region in thefirst image and the classes of the said reference image objects; andafter the said mismatch computation, one or more relative coordinates inthe search regions of the reference image objects are chosen. In oneversion of this embodiment the second image is the immediate past imageobject.

In a preferred embodiment of the invention for each region in the firstimage object, a class is defined for each of the reference imageobjects; for each region in the first image object (e.g. frame N), asecond image object (e.g. frame N-1) is defined and a class is definedfor the second image object; a search region is defined in the secondimage and a number of relative coordinates in the search region areselected for mismatch computation. Complete or partial mismatchcomputations are both possible. For each region in the first imageobject, a class is defined for the said region according to the class ofthe second image object, and one or more of the selected relativecoordinates in the search region in the second image object achievingthe least mismatch measures between the first image object and thesecond image object. For each region in the first image object, theselection of the reference image objects and the selection of relativecoordinate in the said reference image objects for the computation ofthe mismatch measure depends on the class of the said region in thefirst image and the classes of the said reference image objects. Afterthe said mismatch computation, one or more relative coordinates in thesearch regions of the reference image objects are chosen. In a versionof this embodiment the second image is the immediate past image object.

In a preferred embodiment of the invention for each region in the firstimage object, a class is defined for each of the reference imageobjects; for each region in the first image object (e.g. frame N), asecond image object (e.g. frame N-1) is defined and a class is definedfor the second image object; a search region is defined in the secondimage and a number of relative coordinates in the search region areselected for mismatch computation. Complete or partial mismatchcomputation are both possible. For each region in the first imageobject, a class is defined according to the class of the second imageobject, and one or more of the selected relative coordinates in thesearch region in the second image object achieving the least mismatchmeasures between the first image object and the second image object. Foreach region in the first image object, among all the reference imageobjects with the same class, only one image object is selected for thecomputation of the mismatch measure. Depending on the relationshipbetween the class of the said region in the first image object and theclass of any said selected reference image object, a number of therelative coordinates in the search area of the said selected referenceimage object are selected for mismatch computation. Complete or partialmismatch computations are both possible. After the said mismatch measurecomputation, one or more relative coordinates in the search regions ofthe reference image objects are chosen. In a version of this embodimentthe second image is the immediate past image object. In a version ofthis embodiment the selected image object is the image object that isclosest in time within the class (e.g. the nearest past frame with thesame class).

In an embodiment of the invention a class is defined for each region inthe first image object; for each region in the first image object, andfor each relative coordinate in the corresponding search region in anyof the reference image objects, a class is defined for the said relativecoordinate; (a class is defined for each possible search point in eachreference frame); for each region in the first image object, theselection of the reference image objects and the selection of relativecoordinate in the said reference image objects for the computation ofthe mismatch measure depends on the class of the said region in thefirst image and the classes of the said relative coordinate in the saidreference image objects. After the said mismatch computation, one ormore relative coordinates in the search regions of the reference imageobjects are chosen.

In an embodiment of the invention for each region in the first imageobject, and for each relative coordinate in the corresponding searchregion in any of the reference image objects, a class is defined for thesaid relative coordinate; (a class is defined for each possible searchpoint in each reference frame); for each region in the first imageobject (e.g. frame N), a second image object (e.g. frame N-1) is definedand a search region is defined in the second image object and a class isdefined for all the relative coordinates in the said search region inthe second image object; a number of the relative coordinates in thesearch region are selected for mismatch computation. Complete or partialmismatch computations are both possible. For each region in the firstimage object, a class is defined according to (i) one or more relativecoordinates in the search region in the second image object achievingthe least mismatch measures between the first image object and thesecond image object, and (ii) the classes of the said relativecoordinates in the second image object; For each region in the firstimage object, the selection of the reference image objects and theselection of relative coordinate in the said reference image objects forthe computation of the mismatch measure depends on the class of the saidregion in the first image and the classes of the said relativecoordinate in the said reference image objects. In one version of thisembodiment the said second image is the immediate past image object.

In an embodiment of the invention for each region in the first imageobject, and for each relative coordinate in the corresponding searchregion in any of the reference image objects, a class is defined for thesaid relative coordinate; (a class is defined for each possible searchpoint in each reference frame). For each region in the first imageobject (e.g. frame N), a second image object (e.g. frame N-1) is definedand a search region is defined in the second image object and a class isdefined for all the relative coordinates in the said search region inthe second image object; for each region in the first image object, aclass is defined according to (i) a number of relative coordinates inthe search region in the second image object achieving the leastmismatch measures between the first image object and the second imageobject, and (ii) the classes of the said relative coordinate in thesecond image object; (more than one relative coordinates used); for eachregion in the first image object, among all the reference image objectswith the same class, only one image object is selected for thecomputation of the mismatch measure. Depending on the relationshipbetween the class of the said region in the first image object and theclass of any said selected reference image object, a number of therelative coordinates in the search area of the said selected referenceimage object are selected for mismatch computation. Complete or partialmismatch computations are both possible. After the said mismatchcomputation, one or more relative coordinates in the search regions ofthe reference image objects are chosen. In one version of thisembodiment the said second image in is the immediate past image object.In one version of this embodiment the selected image object within aclass is the image object that is closest in time within the class (e.g.the nearest past frame with the same class).

In some embodiments of the invention, the classification of regions canbe updated or reset because classification errors can occur in someclassification methods and such errors can accumulate over time. Theupdating or resetting can be performed periodically or when certainsituations occur. In some embodiments of the invention, the class of thesaid region, of the said second image object and the said referenceimage objects are updated, after the mismatch measure computation,depending on the behavior of the computed mismatch measure. In oneembodiment, the classes can be pixel-location classes such as “integerlocation”, “half-pixel location”, “quarter-pixel location, “⅛-pixellocation”, “ 1/16-pixel location”, In this embodiment, the pixellocation classification of regions in the first image object may bedetermined with respect to the pixel location classification of theregions in the second image object. Thus any error in the pixel locationclassification of the regions in the second image object can affect theaccuracy of the classification in the first image object, which in turncan affect future image objects. In one embodiment, the updating orresetting can be triggered when there is a large reduction in themismatch measure as the motion estimation changed from one pixelprecision to another (e.g. full integer pixel to half-pixel motion, orfrom half-pixel to quarter-pixel, and so on). In one embodiment, whenthere is a large reduction in the mismatch measure as the motion changedfrom integer pixel precision to half-pixel precision, the class of theselected reference image object may be updated or reset to“integer-pixel location”. In another embodiment, when there is a largereduction in the mismatch measure as the motion changed from half-pixelprecision to quarter-pixel precision, the class of the selectedreference image object may be updated or reset to “half-pixel location”and similarly when there is a large reduction in the mismatch measure asthe motion changed from quarter-pixel to ⅛-pixel or ⅛-pixel to 1/16pixel, the selected reference image object may be updated or reset toquarter-pixel or ⅛ pixel respectively.

Viewed from another aspect the present invention provides an overallmulti-frame motion estimation method such that single-frame motionestimation is performed with respect to the most recent reference frame;if best distortion is smaller than a threshold, the search terminates;otherwise, single-frame motion estimation is performed on otherreference frames in such a way that only one frame from the samepixel-location class is examine, the one being the most recent frame. Inthe single-frame motion estimation, full search or fast search can beapplied. In other words, complete or partial mismatch computations areboth possible.

In the above embodiments of the invention, the classes may be definedaccording to the pixel location, with classes such as “integerlocation”, “half-pixel location”, “quarter-pixel location”, “⅛-pixellocation”, “ 1/16-pixel location”, etc. The classes may also be definedaccording to the type of regions (e.g. “smooth”, “edge”, “texture” etc),orientation of edges (e.g. “vertical edge”, “horizontal edge”, and edgeswith various angle of orientation or inclination, etc), width of theedges (e.g. “1-pixel wide edge”, “2-pixel wide edge”, etc), etc, or evena combination of various attributes. Some of these classifications ofthe first image object may require a second image object, while othersdo not.

The embodiments of the invention given above are in terms of the currentH.264 setting. While H.264 allows 5 reference frames, actually anyamount of reference frames can be used. This invention can be appliedwith multiple block size, and the blocks do not necessarily benon-overlapping. The reference frames may be in the past or in thefuture. While only one of the reference frames is used in the abovedescription, more than one frames can be used (e.g. a linear combinationof several reference frames). While H.264 uses discrete cosinetransform, any discrete transform can be applied. Actually, the use of“pixel location” classes in the decision of which reference frames touse, or not to use, can be extended to other decisions. Theclassification of situations according to “pixel location” due tohorizontal and vertical translational motion can be generalized toclasses due to translational and rotational motion in all possibledirections. It can also be generalized to classes due to rigid ornon-rigid objects at different spatial location, orientation and scale.While video is a sequence of “frames” which are 2-dimensional picturesof the world, the invention can be applied to sequences of lower(e.g. 1) or higher (e.g. 3) dimensional description of the world.

For the video, one picture element (pixel) may have one or morecomponents such as the luminance component, the red, green, blue (RGB)components, the YUV components, the YCrCb components, the infra-redcomponents, the X-ray or other components. Each component of a pictureelement is a symbol that can be represented as a number, which may be anatural number, an integer, a real number or even a complex number. Inthe case of natural numbers, they may be 12-bit, 8-bit, or any other bitresolution. While the pixels in video are 2-dimensional samples withrectangular sampling grid and uniform sampling period, the sampling griddoes not need to be rectangular and the sampling period does not need tobe uniform.

Moreover, the present invention in any of its aspect is applicable notonly to the encoding of video, but also to the correspondence estimationin the encoding of audio signals, speech signals, video signals, seismicsignals, medical signals, etc.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described for the sake ofexample only with reference to the following figures, in which:

FIG. 1 illustrates examples of half-pixel motion and quarter-pixelmotion

FIG. 2 shows an example of selection of frames t-1, t-2, t-3 for motionestimation of frame t.

FIG. 3 shows the flow chart of the proposed method FMFME.

FIG. 4 shows the PSNR comparison between full search and the proposedFMFME for a test sequence called “Coastguard”.

DETAILED DESCRIPTION OF THE INVENTION

The present invention, at least in its preferred embodiments, provides anovel fast multi-frame selection scheme that requires significantlyreduced computational cost while achieving similar visual quality andbit-rate as the full selection process. Instead of searching through allthe possible reference frames, the proposed scheme tries to select onlya few representative frames for motion estimation. This is very usefulfor real-time applications.

Accurate prediction can efficiently reduce the degree of error betweenthe original image and the predicted image. For this reason,quarter-pixel translational motion estimation/compensation is adopted inH.264 for better compression performance. Between two consecutiveframes, the object motion may be an integer pixel motion, a half-pixelmotion, or a quarter-pixel motion, ⅛-pixel motion, 1/16-pixel motionetc. Typically, sub-pixel motion estimation algorithms use interpolationto predict the sub-pixel shift of texture relative to the sampling grid.

Suppose an object has edges aligned perfectly with the sensor boundariesat a particular time instant such that the object edge is clear andsharp. We will describe this object as having “integer-pixel location”.When the object undergoes an integer-pixel translational motion, theobject will look exactly the same in the two consecutive frames exceptthat one is a translation to another. And the moved object can bepredicted perfectly by integer-pixel motion estimation.

When the object 100 undergoes a half-pixel motion, the edges may beblurred as shown in FIG. 1 a. We will describe this object as having“half-pixel location”. The original zero-pixel-wide (sharp) object edgenow becomes one pixel-wide (blurred). The pixel at the blurred objectedges 106 may have only half the intensity of the original object, whichcan lead to difficulty in motion estimation. In particular, the block onthe right 102 in FIG. 1 a can be predicted perfectly by block 100 usinghalf-pixel motion estimation. However, block 100 cannot be predictedperfectly by block 102.

Similarly, when the object 100 undergoes a quarter-pixel motion, theedges may be blurred as shown in FIG. 1 b. We will describe this objectas having “quarter-pixel location”. The zero-pixel-wide (sharp) objectedge becomes one-pixel-wide (blurred). The pixels at the blurred edgesmay have ¾ (108) or ¼ (110) of the intensity. Again, the blurred“quarter-pixel location” object 104 can be predicted perfectly from thesharp “integer-pixel location” object 100 using quarter-pixel motionestimation, but not vice versa.

In general, the objects can be classified into sub-pixel locationclasses, namely “integer-pixel location”, “half-pixel location” and“quarter-pixel location”. The edge (and probably texture) details in thethree classes are different.

The advantage of multi-frame ME is that it can further reduce thetemporal redundancy in the video sequences by considering more than onereference frames. The best match is usually found by minimizing the costfunction:

J(m, λ _(MOTION))=SAD(s, c(m))+λ_(MOTION) ·R(m−p))   (1)

with m=(m_(x), m_(y))^(T) being the motion vector, p=(p_(x),p_(y))^(T)being the prediction for the motion vector and λ_(MOTION) being theLagrange multiplier. The term R(m−p) represents the bits used to encodethe motion information and are obtained by table-lookup. The SAD (Sum ofAbsolute Differences) is computed as:

$\begin{matrix}{{S\; A\; {D\left( {s,{c(m)}} \right)}} = {\sum\limits_{{x = 1},{y = 1}}^{B,B}{{{s\left\lbrack {x,y} \right\rbrack} - {c\left\lbrack {{x - m_{x}},{y - m_{y}}} \right\rbrack}}}}} & (2)\end{matrix}$

with B=16, 8 or 4 and s being the original video signal and c being thecoded video signal.

There are basically two types of temporal redundancy that can becaptured by using multiple reference frames but not the traditionalsingle frames. The first type of redundancy is related to short-termmemory. Suppose the current frame is frame t. Sometimes, objects may bedistorted or absent in frame t-1, but well represented in frames t-2 tot-5. An example is the blinking of an eye, which is a very fast motion.If multiple reference frames are allowed, the motion estimation andcompensation can be significantly better than just a single referenceframe. However, in our experiments, we observe that the previousreference frame (i.e. frame t-1) still has the highest probability ofbeing selected among the five possible reference frames.

The second type is the sub-pixel movement of textures described above.Textures and objects with different version of “sub-pixel locations”(“integer-pixel”, “half-pixel” and “quarter-pixel”) may occur insuccessive video frames. In our experiments, we observe that there is agreat tendency for the cost function to be especially small when thesame shifted version of texture is used to do the motion estimation andcompensation. In other words, the optimal reference frame tends to bethe one with the same sub-pixel location as the current frame.

The more reference frames used, the higher the probability for thecurrent image to find a reference frame with the same sub-pixellocation. Probably this is the reason why it has been suggested thatsub-pixel motion compensation is more important in single-frame ME/MCthan in multi-frame ME/MC. Nevertheless, multi-frame motion estimationtogether with sub-pixel accuracy motion estimation is a strongcombination to tackle the sub-pixel movement of texture and edges.However, it is difficult to determine accurately the current sub-pixellocation class a priori.

We also observe that when there is more than one frame with the samesub-pixel location, the one closer to the current frame is usuallybetter. As a result, the motion estimation speed can be increased byestimating the type of shifted version of texture (“sub-pixel location”)in the frame buffer. For example in FIG. 2, the black square meanscollocated macroblocks in Frame t and the multiple reference frames.

So if the macroblock at Frame t-1 and Frame t-2 show the same shiftingcharacteristics, then we can drop the redundant one (Frame t-2) andreduce motion estimation complexity.

Suppose we need to perform multi-frame motion estimation for frame t, asshown in FIG. 2. In the proposed fast multi-frame motion estimation(FMFME) algorithm, each macroblock (e.g., macroblock 200) in thereference frames (frames t-5 to t-1) is assumed to have a sub-pixellocation.

Before performing motion estimation for the current macroblock (withblack frame) in frame t (202), the motion vectors of all the collocatedmacroblocks in frames t-1 to t-5 will be examined. If two or moremacroblocks have the same sub-pixel location, only one frame is enabledfor motion estimation. The sub-pixel location of two macroblocks areconsidered the same if both the x and y components of the sub-pixellocation are equal. The process will enable the frame with smallesttemporal distance (closest) to the current frame.

Motion estimation is then performed on the immediate past frame (i.e.frame t-1 (204)), as this is the most likely frame to be the best. Afterthe motion estimation process, the sub-pixel motion vectors obtained areadded to the sub-pixel locations of the “dominant” reference macroblocks(with maximum overlapping area) to obtain the sub-pixel location of thecurrent macroblock. So for FIG. 2 if we assume frame t-3 (206) is chosento be the reference frame for the black (current) macroblock in frame t,the sub-pixel location of the current macroblock in frame t is obtainedby adding its sub-pixel motion vector to the sub-pixel position of blackmacroblock in frame t-3.

Several early termination checks are then performed. If the best SADwith respect to frame t-1 is smaller than a threshold T₁, we consider it“good enough” and would skip the motion estimation for other referenceframes. If the current macroblock is sufficiently flat such that thereis no strong texture inside the macroblock, we would also skip themulti-frame motion estimation because multi-frame motion estimation insuch cases tends to have little performance gain over single-framemotion estimation. To determine the flatness of the macroblock, 4×4Hadamard transform is performed for every 4×4 sub-block inside themacroblock. If all the AC coefficients of the 4×4 sub-blocks are equalto zero, we would consider the macroblock is sufficiently flat.Otherwise, motion estimation is performed on the enabled referenceframes.

There is also a sub-pixel location refreshing mechanism. Our sub-pixellocation estimation of the macroblocks may not be 100% accurate,especially with potential error drifting over many frames. Theaccumulation of error can affect the performance of FMFME. Therefore,refreshing of sub-pixel location is necessary. We observe in ourexperiments that, for texture at integer-pixel, half-pixel orquarter-pixel locations, only the textures at integer-pixel location cangive a large SAD reduction when it is interpolated/filtered into halfpixel type texture for half-pixel motion estimation. When this occurs,we reset the sub-pixel location of the reference macroblock to beinteger-pixel, and update the current sub-pixel location if appropriate.This refreshing can help to suppress the propagation of error.

Here is how our refreshing mechanism works. During the motion estimationbetween the current and reference frame, the best SAD obtained usinginteger-pixel and half-pixel motion estimation is compared. If thehalf-pixel ME gives 50% reduction in SAD compared with the integer-pixelME, the sub-pixel location type of the reference macroblock will beupdated to be of integer-pixel. FIG. 3 shows the flow chart 300 of theproposed algorithm.

Experimental Results

The proposed FMFME was tested on four QCIF (176×144) sequences, “Akiyo”,“Coastguard”, “Stefan” and “Foreman”, with constant quantization factor,Qp=16 and T₁=512. The proposed scheme was implemented in the H.264reference software TML9.0. PMVFAST was used for fast motion estimation.

Table 1 below gives the results of PSNR (peak signal-to-reconstructedimage measure), bit rate and complexity reduction of the proposed FMFMEin the testing sequences. The computational cost using a powerfulperformance analyzer in term of number of clock cycles is used. Comparedwith the full search (FS) multi-frame selection scheme in H.264, theproposed FMFME can reduce computational cost by 53% (equivalentcomplexity of performing PMVFAST on 2.6 frames instead of 5 referenceframes) with negligibly small PSNR degradation (0.02 dB) and fewer bits(0.10%) on the average. FIG. 4 shows a graph 400 the PSNR comparison for“Coastguard”.

TABLE 1 Comparison of PSNR, bitrate and complexity for H.264 andproposed FMFME algorithm Complexity PSNR(dB) Bitrate Coastguard H.264(FS) 120.66 × 10⁹ 34.01 2472144 FMFME  64.92 × 10⁹ 34.00 2476936 Saving46.2%  −0.01 −0.19% Akiyo H.264 (FS) 102.43 × 10⁹ 38.23 287336 FMFME 28.73 × 10⁹ 38.23 287024 Saving 72% 0.00 1.1% Stefan H.264 (FS) 122.98× 10⁹ 34.01 4814040 FMFME  71.32 × 10⁹ 33.99 4838168 Saving 42% −0.02−0.5% Foreman H.264 (FS) 126.77 × 10⁹ 35.65 1604520 FMFME  71.35 × 10⁹35.60 1617704 Saving 43.64%   −0.05 −0.82%

The fast motion estimation process of embodiments of the presentinvention is mainly targeted for fast, low-delay and low cost softwareand hardware implementation of H.264, or MPEG4 AVC, or AVS, or relatedvideo coding standards or methods. Possible applications include digitalcameras, digital camcorders, digital video recorders, set-top boxes,personal digital assistants (PDA), multimedia-enabled cellular phones(2.5G, 3G, and beyond), video conferencing systems, video-on-demandsystems, wireless LAN devices, bluetooth applications, web servers,video streaming server in low or high bandwidth applications, videotranscoders (converter from one format to another), and other visualcommunication systems, etc.

While several aspects of the present invention have been described anddepicted herein, alternative aspects may be effected by those skilled inthe art to accomplish the same objectives. Accordingly, it is intendedby the appended claims to cover all such alternative aspects as fallwithin the true spirit and scope of the invention.

1.-48. (canceled)
 49. A method comprising: for a macroblock in a currentvideo frame: classifying by a computing device, one or morecorresponding reference macroblocks from one or more reference videoframes; and performing by the computing device, one or more motionestimations for the macroblock with respect to the one or more referencemacroblocks; and generating by the computing device, a resulting motionvector from the one or more motion estimations; wherein, for one or moremacroblocks from a same class, motion estimation is performed on fewerthan all of the one or more macroblocks from the same class.
 50. Themethod of claim 49, wherein performing comprises, for one or moremacroblocks from the same class, performing motion estimation only once.51. The method of claim 50, wherein performing comprises, for one ormore macroblocks from the same class, performing motion estimation onlyfor a macroblock in a most recent reference frame in from the sameclass.
 52. The method of claim 49, wherein classifying macroblockscomprises classifying based at least in part on image and/or videofeatures.
 53. The method of claim 52, wherein classifying based at leastin part on image and/or video features comprises classifying based atleast in part on pixel locations, region types, and/or edge features.54. The method of claim 52, wherein classifying based at least in parton pixel locations comprises classifying based at least in part oninteger and sub-integer locations.
 55. The method of claim 54, whereinclassifying based at least in part on sub-integer locations comprisesclassifying based at least in part on half-pixel and quarter-pixellocations.
 56. The method of claim 52, wherein classifying based atleast in part on region types comprises classifying based at least inpart on smooth, edge, and/or texture region types.
 57. The method ofclaim 52, wherein classifying based at least in part on edge featurescomprises classifying based at least in part on vertical edges,horizontal edges, and angled edges.
 58. The method of claim 52, whereinclassifying based at least in part on edge features comprisesclassifying based at least in part on edge width.
 59. The method ofclaim 49, further comprising: performing, by the computing device, aninitial motion estimation for the macroblock with respect to acorresponding reference macroblock from a most recent reference frame;and determining, by the computing device, whether the initial motionestimation is sufficient for encoding the macroblock.
 60. The method ofclaim 59, wherein determining whether the initial motion estimation issufficient for encoding the macroblock comprises determining whether adistortion based on the initial motion estimation is smaller than athreshold by the computing device.
 61. The method of claim 59, whereindetermining whether the initial motion estimation is sufficient forencoding the macroblock comprises determining whether the macroblock inthe video frame does not have a strong texture by the computing device.62. The method of claim 59, further comprising, in response todetermining that the motion estimation is sufficient, generating by thecomputing device, a resulting motion vector from the motion estimationand performing no additional motion estimations.
 63. The method of claim59, further comprising, in response to determining that the motionestimation is insufficient, performing by the computing device, one ormore additional motion estimations with respect to at least one othermacroblock from an other reference frame, and generating by thecomputing device, a resulting motion vector from the additional motionestimation or estimations.
 64. The method of claim 49, wherein:classifying comprises classifying macroblocks based at least in part onpixel locations; and wherein the method further comprises, for at leastone of the reference macroblocks, updating by the computing device,respective pixel locations for the respective at least one macroblocks.65. The method of claim 64, wherein updating respective pixel locationscomprises, for a reference macroblock: computing by the computingdevice, a first distortion between the macroblock from the current videoframe and the reference macroblock for an integer-pixel motionestimation; computing by the computing device, a second distortionbetween the current macroblock and the reference macroblock for ahalf-pixel motion estimation; and in response to determining that thesecond distortion is lower than a threshold, wherein the threshold isbased at least in part on the first distortion, updating by thecomputing device, a sub-pixel location type for the reference macroblockto be integer-pixel.
 66. An article of manufacture including acomputer-readable medium having instructions stored thereon configure toenable a computing device, in response to execution of the instructionsby the computing, to perform operations comprising: for a macroblock ina current video frame: classifying one or more corresponding referencemacroblocks from one or more reference video frames; and performing oneor more motion estimations for the macroblock with respect to the one ormore reference macroblocks; and generating a resulting motion vectorfrom the one or more motion estimations; wherein, for one or moremacroblocks from a same pixel-location class, motion estimation is notperformed for at least one of the macroblocks.
 67. The article of claim66, wherein the method further comprises: performing an initial motionestimation for the macroblock with respect to a corresponding referencemacroblock from a most recent reference frame; determining whether themotion estimation is sufficient for encoding the macroblock; in responseto determining that the motion estimation is sufficient, generating aresulting motion vector from the motion estimation and performing noadditional motion estimations; and in response to determining that themotion estimation is insufficient, performing one or more additionalmotion estimations with respect to at least one other macroblock from another reference frame, and generating a resulting motion vector from theadditional motion estimation or estimations.
 68. The article of claim67, wherein determining whether the initial motion estimation issufficient comprises determining whether a distortion based on theinitial motion estimation is smaller than a threshold.
 69. The articleof claim 67, wherein determining whether the initial motion estimationis sufficient comprises determining whether the macroblock in the videoframe does not have a strong texture.
 70. The article of claim 66,wherein the operations further comprise, for at least one of the othermacroblocks, updating respective pixel locations for the respective atleast one macroblocks.
 71. The article of claim 70, wherein updatingrespective pixel locations comprises, for a reference macroblock:computing a first distortion between the macroblock from the currentvideo frame and the reference macroblock for an integer-pixel motionestimation; computing a second distortion between the macroblock fromthe current video frame and the reference macroblock for a half-pixelmotion estimation; and in response to determining that the seconddistortion is lower than a threshold, the threshold based at least inpart on the first distortion, updating a sub-pixel location type for thereference macroblock to be integer-pixel.
 72. An article of manufactureincluding a computer-readable medium having instructions stored thereonconfigure to enable a computing device, in response to execution of theinstructions by the computing, to perform operations comprising, for acurrent macroblock in a video frame: performing by an computing device,an initial motion estimation for the macroblock with respect to acorresponding most recent reference macroblock from a most recentreference frame; in response to determining that a distortion based onthe initial motion estimation is smaller than a threshold, generating bythe computing device, a resulting motion vector from the initial motionestimation; in response to determining that a distortion based on theinitial motion estimation is larger than the threshold: performing bythe computing device, one or more additional motion estimations withrespect to at least one other macroblock from an other reference frame,and generating a resulting motion vector from the additional motionestimations; and after performing one motion estimation for one or morereference frames with corresponding macroblocks from a samepixel-location class, terminating performing by the computing device,the one or more additional motion estimations for the samepixel-location class.
 73. The method of claim 70, further comprising,for at least one of the other macroblocks, updating by the computingdevice, respective pixel locations for the respective at least onemacroblocks.